Goto

Collaborating Authors

 coordinate frame


Latent Field Discovery In Interacting Dynamical Systems With Neural Fields

Neural Information Processing Systems

Systems of interacting objects often evolve under the influence of field effects that govern their dynamics, yet previous works have abstracted away from such effects, and assume that systems evolve in a vacuum. In this work, we focus on discovering these fields, and infer them from the observed dynamics alone, without directly observing them.


Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

Konstantinidis, Fabian, Sackmann, Moritz, Hofmann, Ulrich, Stiller, Christoph

arXiv.org Artificial Intelligence

Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.


PPL: Point Cloud Supervised Proprioceptive Locomotion Reinforcement Learning for Legged Robots in Crawl Spaces

Ma, Bida, Xu, Nuo, Qi, Chenkun, Liu, Xin, Mo, Yule, Wang, Jinkai, Lu, Chunpeng

arXiv.org Artificial Intelligence

--Legged locomotion in constrained spaces (called crawl spaces) is challenging. In crawl spaces, current proprioceptive locomotion learning methods are difficult to achieve traverse because only ground features are inferred. In this study, a point cloud supervis ed RL framework for proprioceptive locomotion in crawl spaces is proposed . A state estimation network is designed to estimate the robot's collision states as well as ground and spatial features for locomotion . A point cloud feature extraction method is proposed to supervise the state estimation network . The method uses representation of the point cloud in polar coordinate frame and MLP s for efficient feature extracti on. Experiments demonstrate that, compared with existing methods, our method exhibits faster iteration time in the training and more agile locomotion in crawl spaces. This study enhances the ability of leg ged robots to traverse constrained spaces w ithout requiring exteroceptive sensors. N recent years, legged robots have demonstrated remarkable terrain traversal capabilities, exhibiting significant application value.


METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

Fu, Yankai, Chen, Ning, Zhao, Junkai, Shan, Shaozhe, Yao, Guocai, Wang, Pengwei, Wang, Zhongyuan, Zhang, Shanghang

arXiv.org Artificial Intelligence

Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.


Unsupervised learning of object frames by dense equivariant image labelling

Neural Information Processing Systems

One of the key challenges of visual perception is to extract abstract models of 3D objects and object categories from visual measurements, which are affected by complex nuisance factors such as viewpoint, occlusion, motion, and deformations. Starting from the recent idea of viewpoint factorization, we propose a new approach that, given a large number of images of an object and no other supervision, can extract a dense object-centric coordinate frame. This coordinate frame is invariant to deformations of the images and comes with a dense equivariant labelling neural network that can map image pixels to their corresponding object coordinates. We demonstrate the applicability of this method to simple articulated objects and deformable objects such as human faces, learning embeddings from random synthetic transformations or optical flow correspondences, all without any manual supervision.





Expanded methods

Neural Information Processing Systems

The graphical model of DGP is summarized in Figure 1 . First let's define the potential function Now let's define the Gaussian bump. We will write everything in vector form hereafter. We want to "let the data speak" and avoid oversmoothing, so the penalty weights Given the approximate posterior (eq. To understand the various terms in the ELBO above it is helpful to start with a simpler special case.


A translation invariance

Neural Information Processing Systems

In 2 dimensions, we use eq. Simplified rotations In 2 dimensions, the computations can be simplified since rotations commute. Thus, we wrap the computed angle difference so that it always belongs in that range. Furthermore, in all cases that angles are not used geometrically ( e.g. for rotations), we In 3 dimensions, the computation of rotation matrices is more involved than the 2D case. As explained in section 2.1, input trajectories are described by the states In the following equations, we remove time indices to reduce clutter.